IEICE global.ieice.org Site

Author Search Result

[Author] Hideharu AMANO(66hit)

21-40hit(66hit)

An FPGA-Based Optimizer Design for Distributed Deep Learning with Multiple GPUs
Tomoya ITSUBO Michihiro KOIBUCHI Hideharu AMANO Hiroki MATSUTANI

PAPER

Pubricized:
2021/07/01
Vol:
E104-D No:12
Page(s):
2057-2067
Since deep learning workloads perform a large number of matrix operations on training data, GPUs (Graphics Processing Units) are efficient especially for the training phase. A cluster of computers each of which equips multiple GPUs can significantly accelerate the deep learning workloads. More specifically, a back-propagation algorithm following a gradient descent approach is used for the training. Although the gradient computation is still a major bottleneck of the training, gradient aggregation and optimization impose both communication and computation overheads, which should also be reduced for further shortening the training time. To address this issue, in this paper, multiple GPUs are interconnected with a PCI Express (PCIe) over 10Gbit Ethernet (10GbE) technology. Since these remote GPUs are interconnected with network switches, gradient aggregation and optimizers (e.g., SGD, AdaGrad, Adam, and SMORMS3) are offloaded to FPGA-based 10GbE switches between remote GPUs; thus, the gradient aggregation and parameter optimization are completed in the network. The proposed FPGA-based 10GbE switches with the four optimizers are implemented on NetFPGA-SUME board. Their resource utilizations are increased by PEs for the optimizers, and they consume up to 56% of the resources. Evaluation results using four remote GPUs connected via the proposed FPGA-based switch demonstrate that these optimizers are accelerated by up to 3.0x and 1.25x compared to CPU and GPU implementations, respectively. Also, the gradient aggregation throughput by the FPGA-based switch achieves up to 98.3% of the 10GbE line rate.
A Multi-Tenant Resource Management System for Multi-FPGA Systems
Miho YAMAKURA Ryousei TAKANO Akram BEN AHMED Midori SUGAYA Hideharu AMANO

PAPER

Pubricized:
2021/10/08
Vol:
E104-D No:12
Page(s):
2078-2088
FPGA (Field Programmable Gate Array) based accelerators are attracting significant interest in cloud computing systems. Combining multi-FPGA systems with cloud computing brings a new perspective to the reconfigurable computing research. However, the multi-tenancy of a multi-FPGA system has not been fully discussed in the previous researches. In this paper, we propose a multi-tenant resource management system, named FiC-RM, for a multi-FPGA cloud system. FiC-RM provides users with a set of FPGA resources according to their requirements and allows them to exclusively access FPGA boards and the interconnection network. To achieve this, we propose a placement algorithm which is a key to efficiently share the limited resources. We demonstrate FiC-RM controls a practical scale multi-FPGA system. Moreover, Our simulation study shows that our placement algorithm achieved 3 to 4% improvement in the average resource usage and a 20-second reduction in the response time, compared to other existing naive algorithms.
Boosting the Performance of Interconnection Networks by Selective Data Compression
Naoya NIWA Hideharu AMANO Michihiro KOIBUCHI

PAPER

Pubricized:
2022/07/12
Vol:
E105-D No:12
Page(s):
2057-2065
This study presents a selective data-compression interconnection network to boost its performance. Data compression virtually increases the effective network bandwidth. One drawback of data compression is a long latency to perform (de-)compression operation at a compute node. In terms of the communication latency, we explore the trade-off between the compression latency overhead and the reduced injection latency by shortening the packet length by compression algorithms. As a result, we present to selectively apply a compression technique to a packet. We perform a compression operation to long packets and it is also taken when network congestion is detected at a source compute node. Through a cycle-accurate network simulation, the selective compression method using the above compression algorithms improves by up to 39% the network throughput with a moderate increase in the communication latency of short packets.
A Perpetuum Mobile 32bit CPU on 65nm SOTB CMOS Technology with Reverse-Body-Bias Assisted Sleep Mode
Koichiro ISHIBASHI Nobuyuki SUGII Shiro KAMOHARA Kimiyoshi USAMI Hideharu AMANO Kazutoshi KOBAYASHI Cong-Kha PHAM

PAPER

Vol:
E98-C No:7
Page(s):
536-543
A 32bit CPU, which can operate more than 15 years with 220mAH Li battery, or eternally operate with an energy harvester of in-door light is presented. The CPU was fabricated by using 65nm SOTB CMOS technology (Silicon on Thin Buried oxide) where gate length is 60nm and BOX layer thickness is 10nm. The threshold voltage was designed to be as low as 0.19V so that the CPU operates at over threshold region, even at lower supply voltages down to 0.22V. Large reverse body bias up to -2.5V can be applied to bodies of SOTB devices without increasing gate induced drain leak current to reduce the sleep current of the CPU. It operated at 14MHz and 0.35V with the lowest energy of 13.4 pJ/cycle. The sleep current of 0.14µA at 0.35V with the body bias voltage of -2.5V was obtained. These characteristics are suitable for such new applications as energy harvesting sensor network systems, and long lasting wearable computers.
Remote Dynamic Reconfiguration of a Multi-FPGA System FiC (Flow-in-Cloud)
Kazuei HIRONAKA Kensuke IIZUKA Miho YAMAKURA Akram BEN AHMED Hideharu AMANO

PAPER-Computer System

Pubricized:
2021/05/12
Vol:
E104-D No:8
Page(s):
1321-1331
Multi-FPGA systems have been receiving a lot of attention as a low cost and energy efficient system for Multi-access Edge Computing (MEC). For such purpose, a bare-metal multi-FPGA system called FiC (Flow-in-Cloud) is under development. In this paper, we introduce the FiC multi FPGA cluster which is applied partial reconfiguration (PR) FPGA design flow to support online user defined accelerator replacement while executing FPGA interconnection network and its low-level multiple FPGA management software called remote PR manager. With the remote PR manager, the user can define the FiC FPGA cluster setup by JSON and control the cluster from user application with the cooperation of simple cluster management tool / library called ficmgr on the client host and REST API service provider called ficwww on Raspberry Pi 3 (RPi3) on each node. According to the evaluation results with a prototype FiC FPGA cluster system with 12 nodes, using with online application replacement by PR and on-the-fly FPGA bitstream compression, the time for FPGA bitstream distribution was reduced to 1/17 and the total cluster setup time was reduced by 21∼57% than compared to cluster setup with full configuration FPGA bitstream.
FiC-RNN: A Multi-FPGA Acceleration Framework for Deep Recurrent Neural Networks
Yuxi SUN Hideharu AMANO

PAPER-Computer System

Pubricized:
2020/09/24
Vol:
E103-D No:12
Page(s):
2457-2462
Recurrent neural networks (RNNs) have been proven effective for sequence-based tasks thanks to their capability to process temporal information. In real-world systems, deep RNNs are more widely used to solve complicated tasks such as large-scale speech recognition and machine translation. However, the implementation of deep RNNs on traditional hardware platforms is inefficient due to long-range temporal dependence and irregular computation patterns within RNNs. This inefficiency manifests itself in the proportional increase in the latency of RNN inference with respect to the number of layers of deep RNNs on CPUs and GPUs. Previous work has focused mostly on optimizing and accelerating individual RNN cells. To make deep RNN inference fast and efficient, we propose an accelerator based on a multi-FPGA platform called Flow-in-Cloud (FiC). In this work, we show that the parallelism provided by the multi-FPGA system can be taken advantage of to scale up the inference of deep RNNs, by partitioning a large model onto several FPGAs, so that the latency stays close to constant with respect to increasing number of RNN layers. For single-layer and four-layer RNNs, our implementation achieves 31x and 61x speedup compared with an Intel CPU.
Iterative Synthesis Methods Estimating Programmable-Wire Congestion in a Dynamically Reconfigurable Processor
Takao TOI Takumi OKAMOTO Toru AWASHIMA Kazutoshi WAKABAYASHI Hideharu AMANO

PAPER-High-Level Synthesis and System-Level Design

Vol:
E94-A No:12
Page(s):
2619-2627
Iterative synthesis methods for making aware of wire congestion are proposed for a multi-context dynamically reconfigurable processor (DRP) with a large number of processing elements (PEs) and programmable-wire connections. Although complex data-paths can be synthesized using the programmable-wire, its delay is long especially when wire connections are congested. We propose two iterative synthesis techniques between a high-level synthesizer (HLS) and the place & route tool to shorten the prolonged wire delay. First, we feed back wire delays for each context to a scheduler in the HLS. The experimental results showed that a critical-path delay was shorten by 21% on average for applications with timing closure problems. Second, we skip the routing and estimate wire delays based on the congestion. The synthesis time was shorten to 1/3 causing delay improvement rate degradation at two points on average.
Fine-Grained Run-Tume Power Gating through Co-optimization of Circuit, Architecture, and System Software Design Open Access
Hiroshi NAKAMURA Weihan WANG Yuya OHTA Kimiyoshi USAMI Hideharu AMANO Masaaki KONDO Mitaro NAMIKI

INVITED PAPER

Vol:
E96-C No:4
Page(s):
404-412
Power consumption has recently emerged as a first class design constraint in system LSI designs. Specially, leakage power has occupied a large part of the total power consumption. Therefore, reduction of leakage power is indispensable for efficient design of high-performance system LSIs. Since 2006, we have carried out a research project called “Innovative Power Control for Ultra Low-Power and High-Performance System LSIs”, supported by Japan Science and Technology Agency as a CREST research program. One of the major objectives of this project is reducing the leakage power consumption of system LSIs by innovative power control through tight cooperation and co-optimization of circuit technology, architecture, and system software designs. In this project, we focused on power gating as a circuit technique for reducing leakage power. Temporal granularity is one of the most important issue in power gating. Thus, we have developed a series of Geysers as proof-of-concept CPUs which provide several mechanisms of fine-grained run-time power gating. In this paper, we describe their concept and design, and explain why co-optimization of different design layers are important. Then, three kinds of power gating implementations and their evaluation are presented from the view point of power saving and temporal granularity.
A Novel Channel Assignment Method to Ensure Deadlock-Freedom for Deterministic Routing
Ryuta KAWANO Hiroshi NAKAHARA Seiichi TADE Ikki FUJIWARA Hiroki MATSUTANI Michihiro KOIBUCHI Hideharu AMANO

PAPER-Computer System

Pubricized:
2017/05/19
Vol:
E100-D No:8
Page(s):
1798-1806
Inter-switch networks for HPC systems and data-centers can be improved by applying random shortcut topologies with a reduced number of hops. With minimal routing in such networks; however, deadlock-freedom is not guaranteed. Multiple Virtual Channels (VCs) are efficiently used to avoid this problem. However, previous works do not provide good trade-offs between the number of required VCs and the time and memory complexities of an algorithm. In this work, a novel and fast algorithm, named ACRO, is proposed to endorse the arbitrary routing functions with deadlock-freedom, as well as consuming a small number of VCs. A heuristic approach to reduce VCs is achieved with a hash table, which improves the scalability of the algorithm compared with our previous work. Moreover, experimental results show that ACRO can reduce the average number of VCs by up to 63% when compared with a conventional algorithm that has the same time complexity. Furthermore, ACRO reduces the time complexity by a factor of O(|N|⋅log|N|), when compared with another conventional algorithm that requires almost the same number of VCs.
A Layout-Oriented Routing Method for Low-Latency HPC Networks
Ryuta KAWANO Hiroshi NAKAHARA Ikki FUJIWARA Hiroki MATSUTANI Michihiro KOIBUCHI Hideharu AMANO

PAPER-Interconnection networks

Pubricized:
2017/07/14
Vol:
E100-D No:12
Page(s):
2796-2807
End-to-end network latency has become an important issue for parallel application on large-scale high performance computing (HPC) systems. It has been reported that randomly-connected inter-switch networks can lower the end-to-end network latency. This latency reduction is established in exchange for a large amount of routing information. That is, minimal routing on irregular networks is achieved by using routing tables for all destinations in the networks. In this work, a novel distributed routing method called LOREN (Layout-Oriented Routing with Entries for Neighbors) to achieve low-latency with a small routing table is proposed for irregular networks whose link length is limited. The routing tables contain both physically and topologically nearby neighbor nodes to ensure livelock-freedom and a small number of hops between nodes. Experimental results show that LOREN reduces the average latencies by 5.8% and improves the network throughput by up to 62% compared with a conventional compact routing method. Moreover, the number of required routing table entries is reduced by up to 91%, which improves scalability and flexibility for implementation.
Body Bias Domain Partitioning Size Exploration for a Coarse Grained Reconfigurable Accelerator
Yusuke MATSUSHITA Hayate OKUHARA Koichiro MASUYAMA Yu FUJITA Ryuta KAWANO Hideharu AMANO

PAPER-Architecture

Pubricized:
2017/07/14
Vol:
E100-D No:12
Page(s):
2828-2836
Body biasing can be used to control the leakage power and performance by changing the threshold voltage of transistors after fabrication. Especially, a new process called Silicon-On-Thin Box (SOTB) CMOS can control their balance widely. When it is applied to a Coarse Grained Reconfigurable Array (CGRA), the leakage power can be much reduced by precise bias control with small domain size including a small number of PEs. On the other hand, the area overhead for separating power domain and delivering a lot of wires for body bias voltage supply increases. This paper explores the grain of domain size of an energy efficient CGRA called CMA (Cool Mega Array). By using Genetic Algorithm based body bias assignment method, the leakage reduction of various grain size was evaluated. As a result, a domain with 2x1 PEs achieved about 40% power reduction with a 6% area overhead. It has appeared that a combination of three body bias voltages; zero bias, weak reverse bias and strong reverse bias can achieve the optimal leakage reduction and area overhead balance in most cases.
Analysis of Body Bias Control Using Overhead Conditions for Real Time Systems: A Practical Approach
Carlos Cesar CORTES TORRES Hayate OKUHARA Nobuyuki YAMASAKI Hideharu AMANO

PAPER-Computer System

Pubricized:
2018/01/12
Vol:
E101-D No:4
Page(s):
1116-1125
In the past decade, real-time systems (RTSs), which must maintain time constraints to avoid catastrophic consequences, have been widely introduced into various embedded systems and Internet of Things (IoTs). The RTSs are required to be energy efficient as they are used in embedded devices in which battery life is important. In this study, we investigated the RTS energy efficiency by analyzing the ability of body bias (BB) in providing a satisfying tradeoff between performance and energy. We propose a practical and realistic model that includes the BB energy and timing overhead in addition to idle region analysis. This study was conducted using accurate parameters extracted from a real chip using silicon on thin box (SOTB) technology. By using the BB control based on the proposed model, about 34% energy reduction was achieved.
Data Multicasting Procedure for Increasing Configuration Speed of Coarse Grain Reconfigurable Devices
Vasutan TUNBUNHENG Masayasu SUZUKI Hideharu AMANO

PAPER-Computer Systems

Vol:
E90-D No:2
Page(s):
473-481
A novel configuration method called Row Multicast Configuration (RoMultiC) is proposed for high speed configuration of coarse grain reconfigurable systems. The same configuration data can be transferred in multicast fashion to configure many Processing Elements (PEs) by using a multicast bit-map provided in row and column directions of PE array. Evaluation results using practical applications show that a model reconfigurable system that incorporates this scheme can reduce configuration clock cycles by up to 73.1% compared with traditional configuration delivery scheme. Amount of required memory to store the configuration data at external memory is also reduced by omitting the duplicated configuration data.
Optimization of Body Biasing for Variable Pipelined Coarse-Grained Reconfigurable Architectures
Takuya KOJIMA Naoki ANDO Hayate OKUHARA Ng. Anh Vu DOAN Hideharu AMANO

PAPER-Computer System

Pubricized:
2018/03/09
Vol:
E101-D No:6
Page(s):
1532-1540
Variable Pipeline Cool Mega Array (VPCMA) is a low power Coarse Grained Reconfigurable Architecture (CGRA) based on the concept of CMA (Cool Mega Array). It provides a pipeline structure in the PE array that can be configured so as to fit target algorithms and required performance. Also, VPCMA uses the Silicon On Thin Buried oxide (SOTB) technology, a type of Fully Depleted Silicon On Insulator (FDSOI), so it is possible to control its body bias voltage to provide a balance between performance and leakage power. In this paper, we study the optimization of the VPCMA body bias while considering simultaneously its variable pipeline structure. Through evaluations, we can observe that it is possible to achieve an average reduction of energy consumption, for the studied applications, of 17.75% and 10.49% when compared to respectively the zero bias (without body bias control) and the uniform (control of the whole PE array) cases, while respecting performance constraints. Besides, it is observed that, with appropriate body bias control, it is possible to extend the possible performance, hence enabling broader trade-off analyzes between consumption and performance. Considering the dynamic power as well as the static power, more appropriate pipeline structure and body bias voltage can be obtained. In addition, when the control of VDD is integrated, higher performance can be achieved with a steady increase of the power. These promising results show that applying an adequate optimization technique for the body bias control while simultaneously considering pipeline structures can not only enable further power reduction than previous methods, but also allow more trade-off analysis possibilities.
CLAHE Implementation and Evaluation on a Low-End FPGA Board by High-Level Synthesis
Koki HONDA Kaijie WEI Masatoshi ARAI Hideharu AMANO

PAPER

Pubricized:
2021/07/12
Vol:
E104-D No:12
Page(s):
2048-2056
Automobile companies have been trying to replace side mirrors of cars with small cameras for reducing air resistance. It enables us to apply some image processing to improve the quality of the image. Contrast Limited Adaptive Histogram Equalization (CLAHE) is one of such techniques to improve the quality of the image for the side mirror camera, which requires a large computation performance. Here, an implementation method of CLAHE on a low-end FPGA board by high-level synthesis is proposed. CLAHE has two main processing parts: cumulative distribution function (CDF) generation, and bilinear interpolation. During the CDF generation, the effect of increasing loop initiation interval can be greatly reduced by placing multiple Processing Elements (PEs). and during the interpolation, latency and BRAM usage were reduced by revising how to hold CDF and calculation method. Finally, by connecting each module with streaming interfaces, using data flow pragmas, overlapping processing, and hiding data transfer, our HLS implementation achieved a comparable result to that of HDL. We parameterized the components of the algorithm so that the number of tiles and the size of the image can be easily changed. The source code for this research can be downloaded from https://github.com/kokihonda/fpga_clahe.
A Compression Router for Low-Latency Network-on-Chip
Naoya NIWA Yoshiya SHIKAMA Hideharu AMANO Michihiro KOIBUCHI

PAPER-Computer System

Pubricized:
2022/11/08
Vol:
E106-D No:2
Page(s):
170-180
Network-on-Chips (NoCs) are important components for scalable many-core processors. Because the performance of parallel applications is usually sensitive to the latency of NoCs, reducing it is a primary requirement. In this study, a compression router that hides the (de)compression-operation delay is proposed. The compression router (de)compresses the contents of the incoming packet before the switch arbitration is completed, thus shortening the packet length without latency penalty and reducing the network injection-and-ejection latency. Evaluation results show that the compression router improves up to 33% of the parallel application performance (conjugate gradients (CG), fast Fourier transform (FT), integer sort (IS), and traveling salesman problem (TSP)) and 63% of the effective network throughput by 1.8 compression ratio on NoC. The cost is an increase in router area and its energy consumption by 0.22mm2 and 1.6 times compared to the conventional virtual-channel router. Another finding is that off-loading the decompressor onto a network interface decreases the compression-router area by 57% at the expense of the moderate increase in communication latency.
FOREWORD Open Access
HIDEHARU AMANO

FOREWORD

Vol:
E96-D No:12
Page(s):
2513-2513
Novel Chip Stacking Methods to Extend Both Horizontally and Vertically for Many-Core Architectures with ThrouChip Interface
Hiroshi NAKAHARA Tomoya OZAKI Hiroki MATSUTANI Michihiro KOIBUCHI Hideharu AMANO

PAPER-Architecture

Pubricized:
2016/08/24
Vol:
E99-D No:12
Page(s):
2871-2880
The increase of recent non-recurrent engineering cost (design, mask and test cost) have made large System-on-Chip (SoC) difficult to develop especially with advanced technology. We radically explore an approach for cheap and flexible chip stacking by using Inductive coupling ThruChip Interface (TCI). In order to connect a large number of small chips for building a large scale system, novel chip stacking methods called the linear stacking and staggered stacking are proposed. They enable the system to be extended to x or/and y dimensions, not only to z dimension. Here, a novel chip staking layout, and its deadlock-free routing design for the case using single-core chips and multi-core chips are shown. The network with 256 nodes formed by the proposed stacking improves the latency of 2D mesh by 13.8% and the performance of NAS Parallel Benchmarks by 5.4% on average compared to that of 2D mesh.
The Compatible Acknowledging Ethernet
Toshitada SAITO Mario TOKORO Hideharu AMANO

PAPER-Switching and Communication Processing

Vol:
E70-E No:10
Page(s):
960-967
Ethernet is a scheme of local computer networking, which has been widely utilized. Acknowledging Ethernet insures immediate acknowledgement by providing an acknowledgement mechanism in the MAC (Media Access Control) layer protocol. Although Acknowledging Ethernet gives both higher reliability and higher performance, this scheme is not compatible with DIX or IEEE 802 Ethernet. In this paper, a modified scheme which is compatible with the original Ethernet, called the Compatible Acknowledging Ethernet (CAE), is proposed. The CAE scheme allows sharing of the same cable with Ethernet. CAE can be implemented by adding a small amount of logic to the Network Interface Unit of the original Ethernet.
A Leakage Efficient Instruction TLB Design for Embedded Processors
Zhao LEI Hui XU Daisuke IKEBUCHI Tetsuya SUNATA Mitaro NAMIKI Hideharu AMANO

PAPER-Computer System

Vol:
E94-D No:8
Page(s):
1565-1574
This paper presents a leakage-efficient instruction TLB (Translation Lookaside Buffer) design for embedded processors. The key observation is that when programs enter a physical page, the following instructions tend to be fetched from the same page for a rather long time. Thus, by employing a small storage component which holds the recent address-translation information, the TLB access frequency can be drastically decreased, and the instruction TLB can be turned into the low-leakage mode with the dual voltage supply technique. Based on such a design philosophy, three leakage control policies are proposed to maximize the leakage reduction efficiency. Evaluation results with eight MiBench programs show that the proposed design can reduce the leakage power of the instruction TLB by 50% on average, with only 0.01% performance degradation.

21-40hit(66hit)

Author Search Result

[Author] Hideharu AMANO(66hit)

An FPGA-Based Optimizer Design for Distributed Deep Learning with Multiple GPUs

A Multi-Tenant Resource Management System for Multi-FPGA Systems

Boosting the Performance of Interconnection Networks by Selective Data Compression

A Perpetuum Mobile 32bit CPU on 65nm SOTB CMOS Technology with Reverse-Body-Bias Assisted Sleep Mode

Remote Dynamic Reconfiguration of a Multi-FPGA System FiC (Flow-in-Cloud)

FiC-RNN: A Multi-FPGA Acceleration Framework for Deep Recurrent Neural Networks

Iterative Synthesis Methods Estimating Programmable-Wire Congestion in a Dynamically Reconfigurable Processor

Fine-Grained Run-Tume Power Gating through Co-optimization of Circuit, Architecture, and System Software Design Open Access

A Novel Channel Assignment Method to Ensure Deadlock-Freedom for Deterministic Routing

A Layout-Oriented Routing Method for Low-Latency HPC Networks

Body Bias Domain Partitioning Size Exploration for a Coarse Grained Reconfigurable Accelerator

Analysis of Body Bias Control Using Overhead Conditions for Real Time Systems: A Practical Approach

Data Multicasting Procedure for Increasing Configuration Speed of Coarse Grain Reconfigurable Devices

Optimization of Body Biasing for Variable Pipelined Coarse-Grained Reconfigurable Architectures

CLAHE Implementation and Evaluation on a Low-End FPGA Board by High-Level Synthesis

A Compression Router for Low-Latency Network-on-Chip

FOREWORD Open Access

Novel Chip Stacking Methods to Extend Both Horizontally and Vertically for Many-Core Architectures with ThrouChip Interface

The Compatible Acknowledging Ethernet

A Leakage Efficient Instruction TLB Design for Embedded Processors

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles